-
Notifications
You must be signed in to change notification settings - Fork 11.9k
MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648
Conversation
…en kv cache store on CPU
@jukofyork Please help take a look at this change. Thanks! |
@ggerganov Can you please help to take a look at this PR? |
I think the change is good. Would like to see some reports how much it affects the performance on some CUDA configurations. |
@ggerganov Thanks for the reply! I will try to find one and collect some data. |
Sorry for the slow reply - I'm away until late tomorrow, but will try to look at it then. |
Yeah, feel free on this one~ I noticed your comment in last PR that you are on leave :) |
@ggerganov I updated CUDA backend performance data in PR description, the performance data is slightly slower than master branch. It depends on the CPU and GPU kernel implementation I think. |
If using a CPU KV cache with MLA, I would suggest using |
oh, sorry for the late reply, @slaren just missed your comment. Yes, you are right, |
Background
A possible mino fix based on #12801 if I understand correctly.
With the new DS-V3 GGUF file for reduced KV cache, if the user sets no KV offload with
LLAMA_ARG_NO_KV_OFFLOAD=1
, we still see the MLA computation is assigned to gpu-backend(tested with sycl backend), which is unexpected in my understanding.Solution
The root cause is that the last
SV_absorb matmul
's output node for MLA is not controlled by env and assigned to be GPU-backend when doing graph splits (expand up stage). The node is added together with #12801 for MLA with reduced KV Cache, and then the setting in here is invalid because of the new node. This PR forces that node to be the CPU backend if the user setsLLAMA_ARG_NO_KV_OFFLOAD=1
so that the computation between the KV cache store and the attention output can be assigned to CPU.Testing results.(SYCL Backend)
The model I used is the newly updated
DeepSeek-V3-0324
forQ4_K_M
version.https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/Q4_K_M
Functionality wise
LLAMA_ARG_NO_KV_OFFLOAD=1
Force KV cache on CPU. Before the change.LLAMA_ARG_NO_KV_OFFLOAD=1
Force KV cache on CPU. After the change.Performance-wise
Though it depends on the CPU/GPU kernel implementation for MLA, and profiling results show low efficiency in the first matmul in SYCL backend, therefore, the performance speedup is not very meaningful. I just put some perf numbers here for reference.
Prefill speedup: 7%
Decoding speedup: 40%
Testing results.(CUDA Backend tested with 4070)
The model I used is the newly updated
DeepSeek-V3-0324
forQ4_K_M
version.https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/Q4_K_M
Functionality wise
LLAMA_ARG_NO_KV_OFFLOAD=1
Force KV cache on CPU. Before the change.LLAMA_ARG_NO_KV_OFFLOAD=1
Force KV cache on CPU. After the change.Performance-wise
Prefill: 94% of master branch.
Decoding: 98.6% of master branch.